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Abstract 

The past century has seen a steady increase in the need of estimating and pre¬ 
dicting complex systems and making (possibly critical) decisions with limited in¬ 
formation. Although computers have made possible the numerical evaluation of 
sophisticated statistical models, these models are still designed by humans because 
there is currently no known recipe or algorithm for dividing the design of a statistical 
model into a sequence of arithmetic operations. Indeed enabling computers to think 
as humans have the ability to do when faced with uncertainty is challenging in sev¬ 
eral major ways: (1) Finding optimal statistical models remains to be formulated as 
a well posed problem when information on the system of interest is incomplete and 
comes in the form of a complex combination of sample data, partial knowledge of 
constitutive relations and a limited description of the distribution of input random 
variables. (2) The space of admissible scenarios along with the space of relevant 
information, assumptions, and/or beliefs, tends to be infinite dimensional, whereas 
calculus on a computer is necessarily discrete and finite. With this purpose, this 
paper explores the foundations of a rigorous framework for the scientific compu¬ 
tation of optimal statistical estimators/models and reviews their connections with 
Decision Theory, Machine Learning, Bayesian Inference, Stochastic Optimization, 
Robust Optimization, Optimal Uncertainty Quantification and Information Based 
Complexity. 


1 Introduction 

During the past century the need to solve large complex problems in applications such as 
fluid dynamics, neutron transport or ballistic prediction drove the parallel development 
of computers and numerical methods for solving ODEs and PDEs. It is now clear that 
this development lead to a paradigm shift. Before: each new PDE required the develop¬ 
ment of new theoretical methods and the employment of large teams of mathematicians 
and physicists; in most cases, information on solutions was only qualitative and based on 
general analytical bounds on fundamental solutions. After: mathematical analysis and 
computer science worked in synergy to give birth to robust numerical methods (such as 
finite element methods) capable of solving a large spectrum of PDEs without requiring 
the level of expertise of an A. L. Cauchy or level of insight of a R. P. Feynman. This 
transformation can be traced back to sophisticated calculations performed by arrays of 
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human computers organized as parallel clusters such as in the pioneering work of Lewis 
Fry Richardson [121, 89], who in 1922 had a room full of clerks attempt to solve finite- 
difference equations for the purposes of weather forecasting, and the 1947 paper by John 
Von Neumann and Herman Goldstine on Numerical Inverting of Matrices of High Order 
[154], Although Richardson’s predictions failed due to the use of unfiltered data/initial 
conditions/equations and large time-steps not satisfying the CFL stability condition [89] , 
his vision was shared by Von Neumann [89] in his proposal of the Meteorology Research 
Project to the U.S. Navy in 1946, qualified by Platzman [119] as “perhaps the most vi¬ 
sionary prospectus for numerical weather prediction since the publication of Richardsons 
book a quarter-century earlier.” 

The past century has also seen a steady increase in the need of estimating and predict¬ 
ing complex systems and making (possibly critical) decisions with limited information. 
Although computers have made possible the numerical evaluation of sophisticated sta¬ 
tistical models, these models are still designed by humans through the employment of 
multi-disciplinary teams of physicists, computer scientists and statisticians. Contrary 
to the original human computers (such as the ones pioneered by L. F. Richardson or 
overseen by R. P. Feynman at Los Alamos), these human teams do not follow a specific 
algorithm (such as the one envisioned in Richardson’s Forecast Factory where 64,000 
human computers would have been working in parallel and at high speed to compute 
world weather charts [89]) because there is currently no known recipe or algorithm for 
dividing the design of a statistical model into a sequence of arithmetic operations. Fur¬ 
thermore, while human computers were given a specific PDF or ODE to solve, these 
human teams are not given a well posed problem with a well defined notion of solution. 
As a consequence different human teams come up with different solutions to the design 
of the statistical model along with different estimates on uncertainties. 

Indeed enabling computers to think as humans have the ability to do when faced 
with uncertainty is challenging in several major ways: (1) There is currently no known 
recipe or algorithm for dividing the design of a statistical model into a sequence of arith¬ 
metic operations (2) Formulating the search for an optimal statistical estimator/model 
as a well posed problem is not obvious when information on the system of interest is 
incomplete and comes in the form of a complex combination of sample data, partial 
knowledge of constitutive relations and a limited description of the distribution of input 
random variables. (3) The space of admissible scenarios along with the space of rele¬ 
vant information, assumptions, and/or beliefs, tend to be infinite dimensional, whereas 
calculus on a computer is necessarily discrete and finite. 

The purpose of this paper is to explore the foundations of a rigorous/rational frame¬ 
work for the scientific computation of optimal statistical estimators/models for complex 
systems and review their connections with Decision Theory, Machine Learning, Bayesian 
Inference, Stochastic Optimization, Robust Optimization, Optimal Uncertainty Quan¬ 
tification and Information Based Complexity, the most fundamental of these being the 
simultaneous emphasis on computation and performance as in Machine Learning initiated 
by Valiant [149]. 
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2 The UQ problem without sample data 

2.1 Cebysev, Markov and Krem 

Let us start with a simple warm-up problem 

Problem 1. Let A be the set of measures of probability on [0, 1] having mean less than 
m € (0,1). Let be an unknown element of A and let a € (m, 1). What is > a]? 

Observe that given the limited information on iA[X > a] could a priori be any 
number in the interval \_C{A),U{A)\ obtained by computing the sup (inf) of /r[X > a] 
with respect to all possible candidates for /j,\ i.e. 

U{A) := sup fj,[X > a] (1) 

l-lS^ 


and 

C{A) := inf ^[X > a] 

where 

A-.= {/iGM([0,l]) \EAX]<m} 

and A1([0,1]) is the set of Borel probability measures on [0,1]. It is easy to observe 
that the extremum of (1) can be achieved only when fj. is the weighted sum of a Dirac 
mass at 0 and a Dirac mass at a. It follows that, although (1) is an infinite dimensional 
optimization problem, it can be reduced to the simple one-dimensional optimization 
problem obtained by letting p G [0,1] denote the weight of the Dirac mass at 1 and 1 — p 
the weight of the Dirac mass at 0: Maximize p subject to ap = m, producing the Markov 
bound — as solution. 

a 

Problems such as (1) can be traced back to Cebysev [76, Pg. 4] “Given: length, 
weight, position of the centroid and moment of inertia of a material rod with a density 
varying from point to point. It is required to find the most accurate limits for the weight 
of a certain segment of this rod.” According to Krem [76], although Cebysev did solve 
this problem, it was his student Markov who supplied the proof in his thesis. See Krem 
[76] for an account of the history of this subject along with substantial contributions by 
Krem. 

2.2 Optimal Uncertainty Quantification 

The generalization of the process described in Subsection 2.1 to complex systems involv¬ 
ing imperfectly known functions and measures is the point of view of Optimal Uncer¬ 
tainty Quantification (OUQ) [113, 95, 71, 2, 142, 68]. Instead of developing sophisticated 
mathematical solutions, the OUQ approach is to develop optimization problems and re¬ 
ductions, so that their solution may be implemented on a computer, as in Bertsimas and 
Popescu’s [14] convex optimization approach to Cebysev inequalities, and the Decision 
Analysis framework of Smith [133]. 
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To present this generalization, for a topological space T, let be the space of 

real-valued measurable functions and M.{X) be the set of Borel probability measures on 
X. Let A be an arbitrary subset of X{X) x Ai(X), and let ‘h: ^ M be a function 
producing a quantity of interest. 

Problem 2. Let be an unknown element of A. What is 

Therefore, in absence of sample data, in the context of this generalization one is 
interested in estimating <1>(/^,//^), where (/^,/i^) € J^{X) x M.{X) corresponds to an 
unknown reality: the function represents a response function of interest, and 
represents the probability distribution of the inputs of /^. If .A represents all that is 
known about (in the sense that G A and that any (/, /r) G A could, a 

priori, be (/^,/r^) given the available information) then [113] shows that the quantities 


U{A) 

:= sup ^{f,p) 

(2) 




C{A) 

:= inf ^{f,p) 

(3) 


determine the inequality 

C{A)<^f\^l^)<U{A), 

to be optimal given the available information G ^ as follows 

that the inequality (4) follows from the assumption that {f\p^) G A 
e > 0 there exists a (/, p) € A such that 

U{A)-e<m,p)<U{A). 

Consequently since all that is known about {f^,p^) is that {f\p^) G A, it follows that 
the upper bound ^{f^,p^) < U{A) is the best obtainable given that information, and 
the lower bound is optimal in the same sense. 

Although the OUQ optimization problems (2) and (3) are extremely large and al¬ 
though some are computationally intractable, an important subclass enjoys significant 
and practical finite-dimensional reduction properties [113]. First, by [113, Cor. 4.4], al¬ 
though the optimization variables (/, p) lie in a product space of functions and probabil¬ 
ity measures, for OUQ problems governed by linear inequality constraints on generalized 
moments, the search can be reduced to one over probability measures that are products 
of finite convex combinations of Dirac masses with explicit upper bounds on the number 
of Dirac masses. 

Furthermore, in the special case that all constraints are generalized moments of func¬ 
tions of /, the dependency on the coordinate positions of the Dirac masses is eliminated 
by observing that the search over admissible functions reduces to a search over functions 
on an m-fold product of finite discrete spaces, and the search over m-fold products of 
finite convex combinations of Dirac masses reduces to a search over the products of prob¬ 
ability measures on this m-fold product of finite discrete spaces [113, Thm. 4.7]. Finally, 
by [113, Thm. 4.9], using the lattice structure of the space of functions, the search over 
these functions can be reduced to a search over a finite set. 


(4) 

: It is simple to see 
. Moreover, for any 
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Fundamental to this development is Winkler’s [169] generalization of the charac¬ 
terization of the extreme points of compact (in the weak topology) sets of probability 
measures constrained by a finite number of generalized moment inequalities defined by 
continuous functions to non-compact sets of tight measures, in particular probability 
measures on Borel subsets of Polish metric spaces, defined by Borel measurable moment 
functions, along with his [168] development of Choquet theory for weakly closed convex 
non-compact sets of tight measures. These results are based on Kendall’s [70] equiva¬ 
lence between a linearly compact Choquet simplex and a vector lattice and results of 
Dubins [30] concerning the extreme points of affinely constrained convex sets in terms 
of the extreme points of the unconstrained set. It is interesting to note that Winkler 
[169] uses Kendall’s result to derive a strong sharpening of Dubins result [30]. Winkler’s 
results allow the extension of existing optimization results over measures on compact 
metric spaces constrained by continuous generalized moment functions to optimization 
over measures on Borel subsets of Polish spaces constrained by Borel measurable moment 
functions. For systems with symmetry, the Choquet theorem of Varadarajan [151] can 
be used to show that the Dirac masses can be replaced by the ergodic measures in these 
results. The inclusion of sets of functions along with sets of measures in the optimiza¬ 
tion problems facilitates the application to systems with imprecisely known response 
functions. In particular, a result of Ressel [120], providing conditions under which the 
map (/,/i) —/*/U from function/measure pairs to the induced law is Borel measurable, 
facilitates the extension of these techniques from sets of measures to sets of random 
variables. In general, the inclusion of functions in the domain of optimization requires 
the development of generalized programming techniques such as generalized Benders de¬ 
compositions described in Geoffrion [45]. Moreover, as has been so successful in Machine 
Learning, it will be convenient to approximate the space of measurable functions T'(T’) 
by some reproducing kernel Hilbert space 'H{X) C .F'(T’) producing an approximation 
■H(T’) X M.{X) C T'(T’) X A4(T’) to the full base space. Under the mild assumption 
that X is an analytic subset of a Polish space and T-i{X) possesses a measurable feature 
map, it has recently been shown in [109] that 77(<T) is separable. Consequently, since 
all separable Hilbert spaces are isomorphic with it follows that the space x A4(X) 
is a universal representation space for the approximation of X{X) x A4(X). Moreover, 
in that case, since X separable and metric so is A4{X) in the weak topology, and since 
T-L{X) is Polish, it follows that the approximation space T-L{X) x Xi{X) is the product 
of a Polish space and a separable metric space. When furthermore X is Polish it follows 
that the approximation space is the product of Polish spaces and therefore Polish. 

Example 1. A classic example is <h(/, /r) := p[f > a] where a is a safety margin. 
In the certification context one is interested in showing that p)[f^ >«]<£, where 
e is a safety certihcation threshold (i.e. the maximum acceptable //^-probability of the 
system /I exceeding the safety margin a). \ilA{A) < £, then the system associated with 
(/^, //^) is safe even in the worst case scenario (given the information represented by A). 
If C{A) > £, then the system associated with (/^,/i^) is unsafe even in the best case 
scenario (given the information represented by A). If C{A) < £ < U{A), then the safety 
of the system cannot be decided (although one could declare the system to be unsafe 
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due to lack of information). 

2.3 Worst case analysis 

The proposed solutions to Problems 1 and 2 are particular instances of worst case analysis 
that, as noted by [135] and [127, p. 5] is an old concept that could be summarized by 
the popular adage When in doubt, assume the worst! or 

The gods to-day stand friendly, that we may, 

Lovers of peace, lead on our days to age 

But, since the affairs of men rests still uncertain, 

Lets reason with the worst that may befall. 

Julius Caesar, Act 5, Scene 1 
William Shakespeare (1564-1616) 

As noted in [113], an example of worst case analysis in seismic engineering is that of 
Drenick’s critical excitation [29] which seeks to quantify the safety of a structure to 
the worst earthquake given a constraint on its magnitude. The combination of struc¬ 
tural optimization (in various fields of engineering) to produce an optimal design given 
the (deterministic) worst-case scenario has been referred to as Optimization and Anti- 
Optimization [34]. The main difference between OUQ and Anti-optimization lies in the 
fact that the former is based on an optimization over (admissible) functions and measures 
(/,/i), while the latter only involves an optimization over /. Because of its robustness, 
many engineers have adopted the (deterministic) worst-case scenario approach to UQ 
[34, Chap. 10] when a high reliability is required. 

2.4 Stochastic and Robust Optimization 

Robust Control [176] and Robust Optimization [6, 13] have been founded upon the worst 
case approach to uncertainty. Recall that Robust Optimization describes optimization 
involving uncertain parameters. While these uncertain parameters are modeled as ran¬ 
dom variables (of known distribution) in Stochastic Programming [25], Robust Opti¬ 
mization only assumes that they are contained in known (ambiguity) sets. Although, 
as noted in [34], probabilistic methods do not find appreciation among theoreticians and 
practitioners alike because “probabilistic reliability studies involve assumptions on the 
probability densities, whose knowledge regarding relevant input quantities is central”, 
the deterministic worst case approach (limited to optimization problems over /) is some¬ 
times “too pessimistic to be practical” [29, 34] because “it does not take into account 
the improbability that (possibly independent or weakly correlated) random variables con¬ 
spire to produce a failure event” [113] (which constitutes one motivation for considering 
ambiguity sets involving both measures and functions). Therefore OUQ and Distri- 
butionally Robust Optimization (DRO) [6, 48, 13, 174, 177, 166, 52] could be seen as 
middle-ground between the deterministic worst case approach of Robust Optimization 
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[6, 13] and approaches of Stochastic Programming and Chanced Constrained Optimiza¬ 
tion [18, 23] by defining optimization objectives and constraints in terms of expected 
values and probabilities with respect to imperfectly known distributions. 

Although Stochastic Optimization has different objectives than OUQ and DRO, 
many of its optimization results, such as found in Birge and Wets [15], Ermoliev [35] 
and Gaivoronski, [43], are useful. In particular, the well-developed subject of Edmund- 
son and Madansky bounds such as Edmundson [33], Madansky [90, 91], Gassman and 
Ziemba [44] , Huang, Ziemba and Ben-Tal [56] , Frauendorfer [40] , Ben-Tal and Hochman 
[7], Huang, Vertinsky and Ziemba [55], and Kail [66] provide powerful results. Recently 
Hanasusanto, Roitch, Kuhn and Wiesemann [52] derive explicit conic reformulations for 
tractable problem classes and suggest efficiently computable conservative approxima¬ 
tions for intractable ones. In some cases, e.g. Bertsimas and Popescu’s [14] and Han et 
al. [51], DRO/OUQ optimization problems can be reduced to convex optimization. 

2.5 Cebysev inequalities and optimization theory 

As noted in [113], inequalities (4) can be seen as a generalization of Cebysev inequalities. 
The history of classical inequalities can be found in [69], and some generalizations in 
[14] and [150]; in the latter works, the connection between Cebysev inequalities and 
optimization theory is developed based on the work of Mulholland and Rogers [97], 
Godwin [47], Isii [59, 60, 61], Olkin and Pratt [105], Marshall and Olkin [93], and the 
classical Markov-Krein Theorem [69, pages 82 157], among others. We also refer 

to the held of majorization, as discussed in Marshall and Olkin [94], the inequalities 
of Anderson [4], Hoeffding [53], Joe [63], Bentkus et al. [1 i], Bentkus [9, 10], Pinelis 
[117, 118], and Boucheron, Lugosi and Massart [19]. Moreover, the solution of the 
resulting nonconvex optimization problems beneht from duality theories for nonconvex 
optimization problems such as Rockafellar [123] and the development of convex envelopes 
for them, as can be found, for example, in Rikun [122] and Sherali [131]. 

3 The UQ problem with sample data 

3.1 From Game Theory to Decision Theory 

To motivate the general formulation in the presence of sample data, consider another 
simple warm-up problem. 

Problem 3. Let A be the set of measures of probability on [0,1] having mean less 
than m G (0,1). Let be an unknown element of A and let a G (m, 1). You observe 
d := (di,... ,dn), n i.i.d. samples from /rb What is the sharpest estimate of p)[X > a]? 

The only difference between problems 3 and 1 lies in the availability of data sampled 
from the underlying unknown distribution. Observe that, in presence of this sample data, 
the notions of sharpest estimate or smallest interval of confidence are far from being 
transparent and call for clear and precise definitions. Note also that if the constraint 
lE/it [^] < m is ignored, and the number n of sample data is large, then one could use the 
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Central Limit Theorem or a concentration inequality (such as Hoeffding’s inequality) 
to derive an interval of confidence for > a]. A non trivial question of practical 

importance is what to do when n is not large. 

Writing := > a] as the quantity of interest, observe that an estimation 

of is a function (which will be written 9) of the data d. Ideally one would like to 

choose 6 so that the estimation error 9{d) — <h(|U^) is as close as possible to zero. Since d 
is random, a more robust notion of error is that of a statistical error defined by 

weighting the error with respect to a real measurable positive loss function C: R —M 
and the distribution of the data, i.e.. 


£’(6»,/i^) := 


V[9{d)-^fi^)] 


(5) 


Note that for V{x) = the statistical error S{9,fj,^) defined in (5) is the mean 
squared error with respect to the distribution of the data d of the estimation error. 
For V{x) = l[..y^cxD](|3j|) dehned for some 7 > 0, S{9,fj,^) is the probability with respect 
to the distribution of d that the estimation error is larger than 7 . 

Now since is unknown, the statistical error of any 9 is also unknown. 

However one can still identify the least upper bound on that statistical error through a 
worst case scenario with respect to all possible candidates for /x^, i.e. 


sup£’( 0 ,^). ( 6 ) 


The sharpest estimator (possibly within a given class) is then naturally obtained as the 
minimizer of (6) over all functions 9 of the data d within that class, i.e. as the minimizer 
of 

inf sup £’( 0 ,/x). (7) 


Observe that the optimal estimator is identihed independently from the observa¬ 
tion/realization of the data and if the minimum of (7) is not achieved then one can still 
use a near-optimal 9. Then, when the data is observed, the estimate of the quantity of 
interest ‘I*(/i^) is then derived by evaluating the near-optimal 9 on the data d. 

The notion of optimality described here is that of Wald’s Statistical Decision Theory 
[156, 157, 158, 160, 161], evidently influenced by Von Neumann’s Game Theory [153, 155]. 
In Wald’s formulation [157], which cites both Von Neumann [153] and Von Neumann and 
Morgenstern [155], the statistician hnds himself in an adversarial game played against 
the Universe in which he tries to minimize a risk function £{9,n) with respect to 0 in a 
worst case scenario with respect to what the Universe’s choice of /x could be. 


3.2 The optimization approach to statistics 

The optimization approach to statistics is not new and this section will now give a 
short, albeit incomplete, description of its development, primarily using Lehmann’s ac¬ 
count [86] . Accordingly, it began with Gauss and Laplace with the non-parametric result 
referred to as the Gauss-Markov theorem, asserting that the least squares estimates are 




the linear unbiased estimates with minimum variance. Then, in Fisher’s fundamental 
paper [38], for parametric models he proposes the maximum likelihood estimator and 
claims (but does not prove) that such estimators are consistent and asymptotically effi¬ 
cient. According to Lehmann, “the situation is complex, but under suitable restrictions 
Fisher’s conjecture is essentially correct ...” Fisher’s maximum likelihood principle was 
first proposed on intuitive grounds and then its optimality properties developed. How¬ 
ever, according to Lehmann [85, Pg. 1011], Pearson then asked Neyman “Why these 
tests rather than any of the many other that could be proposed? This question resulted 
in Neyman and Pearson’s 1928 papers [103] on the likelihood ratio method, which gives 
the same answer as Fisher’s tests under normality assumptions. However, Neyman was 
not satished. He agreed that the likelihood ratio principle was appealing but felt that 
it was lacking a logically convincing justification. This then led to the publication Ney¬ 
man and Pearson [104], containing their now famous Neyman-Pearson Lemma, which 
according to Lehmann [86], “In a certain sense this is the true start of optimality the¬ 
ory.” In a major extension of the Neyman-Pearson work, Huber [57] proves a robust 
version of the Neyman-Pearson Lemma, in particular, providing an optimality criteria 
defining the robust estimator, giving rise to a rigorous theory of robust statistics based 
on optimality, see Huber’s Wald Lecture [58]. Robustness is particularly suited to the 
Wald framework since robustness considerations are easily formulated with the proper 
choices of admissible functions and measures in the Wald framework. Another example 
is Kiefer’s introduction of optimality into experimental design, resulting in Kiefer’s 40 
papers on Optimum Experimental Designs [73]. 

Not everyone was happy with “optimality” as a guiding principle. For example 
Lehmann [86] states that at a 1958 meeting of the Royal Statistical Society at which 
Kiefer presented a survey talk [72] on Optimum Experimental Designs, Barnard quotes 
Kiefer as saying of procedures proposed in a paper by Box and Wilson that they “often 
[are] not even well-defined rules of operation.” Barnard’s reply: 

“in the field of practical human activity, rules of operation which are not 
well-defined may be preferable to rules which are.” 

Wynn [173], in his introduction to a reprint of Kiefer’s paper, calls this “a clash of 
statistical cultures.” Indeed, it is interesting to read the generally negative responses 
to Kiefer’s article [72] and the remarkable rebuttal by Kiefer therein. Tukey had other 
criticisms regarding “The tyranny of the best” in [147] and “The dangers of optimization” 
in [148]. In the latter he writes: 

“Some [statisticians] seem to equate [optimization] to statistics an attitude 
which, if widely adopted, is guaranteed to produce a dried-up, encysted field 
with little chance of real growth.” 

Eor an account of how the Exploratory Data Analysis approach of Tukey fits within the 
Eisher/Neyman-Pearson debate, see Lehnard [87]. 

Let us also remark on the influence that Student -William Sealy Gosset had on 
both Eisher and Pearson. As presented in Lehmann’s [84] ““Student” and small-sample 
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theory”, quoting F. N. David [78]: “I think he [Gosset] was really the big influence 
in statistics... He asked the questions and Pearson or Fisher put them into statistical 
language and then Neyman came to work with the mathematics. But I think most 
of it stems from Gosset.” The aim of Lehmann’s paper [84] is to consider to what 
extend David’s conclusion is justified. Indeed, the claim is surprising since Gosset is 
mainly known for only one contribution, that is Student [141], with the introduction of 
Student’s t-test and its analysis under the normal distribution. According to Lehmann 
“Today the pathbreaking nature of this paper is generally recognized and has been 
widely commented upon, ...”. Gosset’s primary concern in communicating with both 
Fisher and Pearson was the robustness of the test to non-normality. Lehmann concludes 
that “the main ideas leading to Pearson’s research were indeed provided by Student.” 
See Lehmann [84] for the full account, including Gosset’s relationship to the Fisher- 
Pearson debate, Pearson [115] for a Statistical biography of Gosset, and Fisher [39] for 
a eulogy. Consequently, modern statistics appears to owe a lot to Gosset. Moreover, 
the reason for the pseudonym was a policy by Gosset’s employer, the brewery Arthur 
Guinness Sons and Co., against work done for the firm being made public. Allowing 
Gosset to publish under a pseudonym was a concession that resulted in the birth of the 
statistician Student. Consequently, the authors would like to take this opportunity to 
thank the Guinness Brewery for its influence on statistics today, and for such a fine beer. 

3.3 Abraham Wald 

Following Neyman and Pearson’s breakthrough, a different approach to optimality was 
introduced in Wald [156] and then developed in a sequence of papers culminating in 
Wald’s [161] book Statistical Decision Functions. Evidence of the influence of Neyman 
on Wald can be found in the citation of Neyman [101] in the introduction of Wald [156]. 
Brown [21] quotes the students of Neyman in 1966 from [102]: 

“The concepts of confidence intervals and of the Neyman-Pearson theory 
have proved immensely fruitful. A natural but far reaching extension of 
their scope can be found in Abraham Wald’s theory of statistical decision 
functions. The elaboration and application of the statistical tools related to 
these ideas has already occupied a generation of statisticians. It continues to 
be the main lifestream of theoretical statistics.” 

Brown’s purpose was to address if the last sentence in the quote was still true in 2000. 

Wolfowitz [170] describes the primary accomplishments of Wald’s Statistical Decision 
Theory as follows: 

“Wald’s greatest achievement was the theory of statistical decision functions, 
which includes almost all problems which are the raison d’etre of statistics” 

Leonard [88, Chp. 12] portrays Von Neumann’s return to Game Theory as “partly an 
early reaction to upheaval and war”. However he adds that eventually Von Neumann 
became personally involved in the war effort and “with that involvement came a signifi¬ 
cant, unforeseeable moment in the history of game theory: this new mathematics made 
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its wartime entrance into the world, not as the abstract theory of social order central to 
the book, but as a problem solving technique.” Moreover, on pages 278-280 Leonard dis¬ 
cusses the Statistical Research Groups at Berkeley, Columbia, and Princeton, in partic¬ 
ular Wald at Columbia, and how the effort to develop inspection and testing procedures 
leads Wald to the development of sequential methods “apparently yielding significant 
economies in inspection in the Navy”, leading to the publication of Wald and Wolfowitz’ 
[162] proof of the optimality of the sequential probability ratio test and Wald’s book 
[159] Sequential Analysis. Leonard’s claim essentially is that the war stimulated these 
fine theoretical minds to pursue activities with real application value. In this regard, it is 
relevant to note Mangel and Samaniego’s [92] stimulating description of Wald’s work on 
aircraft survivability, along with the contemporary, albeit somewhat vague, description 
of “How a Story from World War II shapes Facebook today” by Wilson [167]. Indeed, in 
the problem of how to allocate armoring to the allied bombers based on their condition 
upon return from their missions, it was discovered that armoring where the previous 
planes had been hit was not improving their rate of return. Wald’s ingenious insight was 
that these were the returning bombers not the ones which had been shot down. So the 
places where the returning bombers were hit are more likely to be the places where one 
does not need to add armoring. Evidently, his rigorous and unconventional innovations 
to transform this intuition into a real methodology saved many lives. Wolfowitz [170] 
states 

“Wald not only posed his statistical problems clearly and precisely, but he 
posed them to fit the practical problem and to accord with the decisions the 
statistician was called on to make. This, in my opinion, was the key to his 
success-a high level of mathematical talent of the most abstract sort, and 
a true feeling for, and insight into, practical problems. The combination of 
the two in his person at such high levels was what gave him his outstanding 
character.” 

The section on Von Neumann and Remak (along with the Notes that follows it) 
in Kurz and Salvadori [77] describes Wald and Von Neumann’s relations. Brown [20] 
credits Wald as the creator of the minmax idea in statistics, evidently given axiomatic 
justification by Gilboa and Schmeidler [46]. This certainly had something to do with 
his friendship with Morgenstern and his relationship with Von Neumann, who together 
authored the famous book [155], but this influence can be explicitly seen in Wald’s 
[157] citation of Von Neumann [153] and Von Neumann and Morgenstern [155] in his 
introduction [157] of the minmax idea in Statistical Decision Theory. Indeed, Wolfowitz 
states that 

“... he was also spurred on by the connection between the newly announced 
results of [Von Neumann and Morgenstern] [155] and his own theory, and by 
the general interest among economists and others aroused by the theory of 
games.” 

Wolfowitz asserts that Wald’s work [156] Contributions to the Theory of Statistical 
Estimation and Testing Hypotheses is “probably his most important paper” but that it 
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“went almost completely unnoticed”, possibly because “The use of Bayes solutions was 
deterrent” and “Wald did not really emphasize that he was using Bayes solutions only as 
a tool.” Moreover, although Wolfowitz considered Wald’s Statistical Decision Functions 
[161] his greatest achievement, also says 

“The statistician who wants to apply the results of [161] to specific prob¬ 
lems is likely to be disappointed. Except for special problems, the complete 
classes are difficult to characterize in a simple manner and have not yet been 
characterized. Satisfactory general methods are not yet known for obtaining 
minimax solutions. If one is not always going to use a minimax solution (to 
which.serious objections have been raised) or a solution satisfying some given 
criterion, then the statistician should have the opportunity to choose from 
among ’’representative” decision functions on the basis of their risk functions. 

These are not available except for the simplest cases. It is clear that much re¬ 
mains to be done before the use of decision functions becomes common. The 
theory provides a rational basis for attacking almost any statistical problem, 
and, when some computational help is available and one makes some reason¬ 
able compromises in the interest of computational feasibility, one can obtain 
a practical answer to many problems which the classical theory is unable to 
answer or answers in an unsatisfactory manner.” 

Wolfowitz [170], Morgenstern [96] and Hotelling [.54] provide a description of Wald’s 
impact at the time of his passing. The influence of Wald’s minimax paradigm can also be 
observed on ( 1 ) decision making under severe uncertainty [134, 135, 136] ( 2 ) Stochastic 
Programming [130] (minimax analysis of stochastic problems) (3) Minimax solutions 
of stochastic linear programming problems [175] (4) Robust Convex Optimization [ 8 ] 
(where one must find the best decision in view of the worst case parameter values within 
deterministic uncertainty sets) (4) Econometrics [143] and (5) Savage’s minimax Regret 
model [128]. 

3.4 Generalization to unknown pairs of functions and measures and to 
arbitrary sample data. 

In practice, complex systems of interest may involve, both an imperfectly known re¬ 
sponse function /I and an imperfectly known probability measure as illustrated in 
the following problem. 

Problem 4. Let M be a set of real functions and measures of probability on [0,1] such 
that (/, /r) G M if and only if E^[X] < m and sup^-gjo,!] \g{x) — f{x)\ < 0.1 where g 
is some given real function on [ 0 , 1 ]. Let (/^,/i^) be an unknown element of A and let 
a € M. Let {Xi,..., Xn) be n i.i.d. samples from you observe {di,... ,dn) with 
di = (Xj,/1'(Xj)). What is the “sharpest” estimate of /r^[/(X) > a]? 

Problem 4 is an illustration of a situation in which the response function /I and the 
probability measure are not directly observed and the sample data arrives in the form 
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of realizations of random variables, the distribution of which is related to To 

simplify the current presentation assume that this relation is, in general, determined by 
a function of and use the following notation; D will denote the observable space 

(i.e. the space in which the sample data d take values, assumed to be a metrizable Suslin 
space) and d will denote the P-valued random variable corresponding to the observed 
sample data. To represent the dependence of the distribution of d on the unknown state 
€ A introduce a measurable function 

B:A^M{V), (8) 

where A4(T>) is given the Borel structure corresponding to the weak topology, to define 
this relation. The idea is that 0(/, /u) is the probability distribution of the observed 
sample data d if (/^, = (/, and for this reason it may be called the data map (or, 

even more loosely, the observation operator). Now consider the following problem. 

Problem 5. Let .A be a known subset of T{X) x M.{X) as in Problem 2 and let B 
be a known data map as in ( 8 ). Let be a known measurable semi-bounded function 
mapping A onto M. Let {p,pL^) be an unknown element of A. You observe d ^ T) 
sampled from the distribution D(/f,^f). What is the sharpest estimation of 


3.5 Model error and optimal models 


As in Wald’s Statistical Decision Theory [157], a natural notion of optimality can be 
obtained by formulating Problem 5 as an adversarial game in which player A chooses 
(/^, G A and player B (knowing A and B) chooses a function 9 of the observed data 
d. As in (5) this notion of optimality requires the introduction of a risk function 


£{9,if,fv)) V[e{d)-^f,fi)] 


(9) 


where Y: M ^ R is a real positive measurable loss function function. As in ( 6 ) the least 
upper bound on that statistical error £[9, {f, p,)) is obtained as through a worst case 
scenario with respect to all possible candidates for (/,/i) (player’s A choice), i.e. 

sup £{e,{f,fi)) ( 10 ) 


and an optimal estimator/model (possibly within a given class) is then naturally obtained 
as the minimizer of (10) over all functions 9 of the data d in that class (player’s B choice), 
i.e. as the minimizer of 

inf sup £{9,{f,fi)). (11) 

^ (/,A6T 

Since in real applications true optimality will never be achieved, it is natural to generalize 
to considering near-minimizers of ( 11 ) as near-optimal models/estimators. 

Remark 1. In situations where the data map is imperfectly known (e.g. when the 
data d is corrupted by some noise of imperfectly known distribution) one has to include 
a supremum over all possible candidates B G S in the calculation of the least upper 
bound on the statistical error. 
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3.6 Mean squared error, variance and bias 

For (/,/i) G A write [^{d)] the variance of the random variable 6{d) when d 

distributed according to 0{f,ij,), i.e. 


Var^^jj(/_^) [ 0 (d)] lEd~D(/,/^) 


(0(d))' 


^dr^O{f,ii) [ 0 (d)] 


The following equation, whose proof is straightforward, shows that for V{x) = x^, the 
least upper bound on the mean squared error of 0 is equal to the least upper bound on 
the sum of the variance of 0 and the square of its bias. 


sup £{9,{f,n)) 


sup 


Varrf^p(j^^) [ 0 (d)] + (Edr.n^f^i,)[0{d)] -$(/,/i)^ 


Therefore, for V{x) = x^, the bias/variance tradeoff is made explicit. 


3.7 Optimal interval of confidence 

Although £ can a priori be defined to be any risk function, taking V{x) = l[-),,oo](l®l) 
(for some 7 > 0) in (5) allows for a transparent and objective identification of optimal 
intervals of confidence. Indeed, writing. 


^7(^>(/>m)) := \d{d) - >7 


note that sup(j^^)g _4 £^ ( 0 , (/, fi)) is the least upper bound on the probability (with respect 
to the distribution of d) that the difference between the true value of the quantity of 
interest and its estimated value 0(d) is larger than 7 . Let e G [0,1]. Define 

7 , := inf {7 > 0 I inf sup £^{e,{f,fi))<e}, 

^ {f,A&A 


and observe that if 0 ^ is a minimizer of inf^ sup(j^^)g^ £j^ ( 0 , (/, /r)) then [ 0 £(d)— 7 ^, 0 £(d) + 
7 e] is the smallest interval of confidence (random interval obtained as a function of 
the data) containing <I>(/f,^f) with probability at least 1 — e. Observe also that this 
formulation is a natural extension of the OUQ formulation as described in [113]. Indeed, 
in the absence of sample data, it is easy to show that 0 i is the midpoint of the optimal 
interval [C{A)M{£^)\- 

Remark 2. We refer to [36, 37, 137] and in particular to Stein’s notorious paradox [138] 
for the importance of a careful choice for loss function. 


3.8 Ordering the space of experiments 

A natural objective of UQ and Statistics is the design of experiments, their comparisons 
and the identification of optimal ones. Introduced in Blackwell [16] and Kiefer [72], with 
a more modern perspective in Le Cam [82] and Strasser [139], here observe that (11), as 
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a function of B, induces an order (transitive, total, but not antisymmetric) on the space 
of data maps that has a natural experimental design interpretation. More precisely if the 
data maps Bi and B 2 are interpreted as the distribution of the outcome of two possible 
experiments and if the value of ( 11 ) is smaller for D 2 than Di, then B 2 is a preferable 
experiment. 

3.9 Mixing models 

Given estimators 0i,..., 0^ can one obtain a better estimator by mixing those estimators? 
If V is convex (or quasi-convex) then the problem of finding an a G [0,1]” minimizing 
the statistical error of under the constraint 'Ya=i = 1 is a hnite-dimensional 

convex optimization problem in a. If estimators are seen as models of reality then this 
observation supports the idea that one can obtain improved models by mixing them 
(with the goal of achieving minimal statistical errors). 


4 The Complete Class Theorem and Bayesian inference 

4.1 The Bayesian approach 

The Bayesian answer to Problem 5 is to assume that (/^,/i^) is a sample from some 
(prior) measure tt on A and then condition the expectation of $(/,/«) with respect to 
the observation of the data, i.e. use 

1 ^] 

as the estimator 9{d). This requires giving A the structure of a measurable space such 
that important quantities of interest such as (/, fi) —>■ n[f{X) > a] and (/, fi) —>■ E^[/] are 
measurable. This can be achieved using results of Ressel [120] providing conditions under 
which the map (/,/x) —)■ from function/measure pairs to the induced law is Borel 

measurable. We will henceforth assume M to be a Suslin space, and proceed to construct 
the measure of probability ttQD of ((/, /x), d) on AxT) via a natural generalization of the 
Campbell measure and Palm distribution associated with a random measure as described 
in [67], see also [24, Ch. 13] for a more current treatment. We refer to Subsection 7.1 of 
the appendix for the details of the construction of the distribution vr © B of ((/, ;u),d) 
when (/,/u) ~ vr and d ~ B(/, /x), and of the marginal distribution vr • B of vr 0 B on 
the data space V, and the resulting regular conditional expectation (12). Consequently, 
the nested expectation appearing in (12) will from now on be rigorously 

written as the expectation E((/,^),d)..^ 7 r 0 D- 


Statistical error when (/^, /x^) is random. When (/t, /x^) is a random realization of 
vrl one may consider averaging the statistical error (9) with respect to vr^ and introduce 


T(0,vrt) := 


(13) 
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When TT^ is an unknown element of a subset 11 of A4{A) the least upper bound on 
the statistical error (13) given the available information is obtained by taking the sup of 
(13) with respect to all possible candidates for vrl', i.e. 

sup£^( 0 , 7 r) (14) 

TTgn 

When A is Suslin and when is not a random sample from vr^ but simply an 

unknown element of A, then a straightforward application of the reduction theorems of 
[113] implies that when 11 = M.{A), that (14) is equal to (11), i.e. 

sup £(6,{f,n))= sup £{6,7r) (15) 


4.2 Relation between adversarial model error and Bayesian error. 


When has a second moment with respect to vr, one can utilize the classical variational 
description of conditional expectation as follows: Denote by the space of (vr • 

D a.e. equivalence classes of) real-valued measurable functions on D that are square- 
integrable with respect to the measure vr • B. Then one has (see Subsection 7.3) 


IE 7 roD[^|d] := argmin 
h&LlAV) 


•'TtOID 


(4>(/,^)-/v(d))‘ 


In other words, if (/,//) is sampled from the measure vr, E^ 0 D[<h(/, ^)|fi] is the best 
mean-square approximation of <h(/,/v) in the space of square-integrable functions of d. 
As with the regular conditional probabilities, the real-valued function on D 


d^{d) = d], deV (16) 


is uniquely defined up to subsets of P of (vr • B)-measure zero. 

Using the orthogonal projection property of the conditional expectation one obtains 
that if V{x) = then for arbitrary 9, 


£{9,7r) = £{6 t,,tt) 


0{d) - OM) 


2 


(17) 


Therefore, if 11 C M{A) is an admissible set of priors, then (17) implies that 


inf sup £’( 0 , vr) > sup £’( 0 ,^, vr). 

^ TTgn TTgn 

In particular, when 11 = JVi(A) (15) implies that 


inf sup £{d,{f,^))> sup £’( 6 * 7 r, 7 r). (18) 

® {f,^l)£A neMiA) 

Therefore, the mean squared error of the best estimator assuming (/fj/r'f) G A to be 
unknown is bounded below by the largest mean squared error of the Bayesian estimator 
obtained by assuming that is distributed according to some vr € M.{A). In the 

next section it will be shown that a complete class theorem can be used to obtain that 
(18) is actually an equality. In that case, (18) can be used to quantify the approximate 
optimality of an estimator by comparing the least upper bound sup(j^^)g_y\ 4 (_ 4 ) £ {9, (/, /i)) 
on the error of that estimator with £’( 0 ^,vr) for a carefully chosen vr. 
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4.3 Complete Class theorem 

A fundamental question is whether (18) is an equality: is the adversarial error of the best 
estimator equal to the non-adversarial error of the worst Bayesian estimator? Is the best 
estimator Bayesian or an approximation thereof? A remarkable result of decision theory 
[156, 157, 158, 160, 161] is the complete class theorem which states (in the formulation of 
this paper) that if (1) the admissible measures /r are absolutely continuous with respect 
the Lebesgue measure (2) the loss function V in the definition of is convex 

in 9 and (3) the decision space is compact, then optimal estimators live in the Bayesian 
class and non-Bayesian estimators cannot be optimal. The idea of the proof of this result 
is to use the compactness of the decision space and the continuity of the loss function to 
approximate the decision theory game by a finite game and recall that optimal strategies 
of adversarial finite zero-sum games are mixed strategies [99, 98]. 

Le Cam [80], see also [82], has substantially extended Wald’s theory in the sense 
that requirements of boundedness, or even hniteness, of the loss function are replaced 
by a requirement of lower semicontinuity, and the requirements of the compactness of 
the decision space and the absolute continuity of the admissible measures with respect 
the Lebesgue measure are removed. These vast generalizations come at some price 
of abstraction, yet reveal the relevance and utility of an appropriate complete Banach 
lattice of measures. In particular, this framework of Le Cam appears to facilitate efficient 
concrete approximation. 

As an illustration, let us describe a complete class theorem on a space of admissible 
measures, without the inclusion of functions, where the observation map consists of 
taking n-i.i.d. samples, as in Equation (5). Let A C M[X) be a subset of the Borel 
probability measures on a topological space X and consider a quantity of interest <I> : 
.4. —>■ M. For fj, £ A, the data d is generated by i.i.d. sampling with respect to That 
is d ~ //"■. For E A, the statistical error of an estimator 9 : T” —>■ M of ‘h(/rl) 

is defined in terms of a loss function F: M —>■ M as in (5). Define the least upper bound 
on that statistical error and the sharpest estimator as in (6) and (7). 

Let 0 := {0 : X^ —M, d measurable} denote the space of estimators. Since, in 
general the game £{9,^),9 E 0,/U E A will not have a value, that is one will have a 
strict inequality 

sup inf £{9,fi) < inf sup£{9,fi), 

fj.€A 

classical arguments in game theory suggest that one extend to random estimators and 
random selection in A. To that end, let the set of randomized estimators TZ := {9 : 
X^ X B(M) —)■ [0,1], 9 Markov} be the set of Markov kernels. To define a topology for TZ, 
define a linear space of measures as follows. Let := {/r” E M.{X'^) : // E A} denote 
the corresponding set of measures generating sample data. Say that A^ is dominated if 
there exists an w E M.{X"') such that every E A^ is absolutely continuous with respect 
to io. According to the Halmos-Savage Lemma [49], see also Strasser [139, Lem. 20.3], the 
set A^ is dominated if and only if there exists a countable mixture fj,* := with 

ai > 0, Hi £ A,i = 1,... oo, and XlSi — 1) such that ,h € .4. A construct at 

the heart of Le Cam’s approach is a natural linear space notion of a mixture space of A, 
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called the L-space L{A'^) := It follows easily, see [139, Lem. 41.1], that L{A^) is 

the set of signed measures which are absolutely continuous with respect to fi*. When A 
is not dominated a natural generalization of this construction [139, Def. 41.3] due to Le 
Cam [80] is used. A crucial property of the L-space L{A^) is that not only is it a Banach 
lattice, see Strasser [139, Cor. 41.4], but by [139, Lem. 41.5] it is a complete lattice. The 
utility of the concept of a complete lattice to the complete class theorems can clearly be 
seen in the proof of the lemma in Section 2 of Wald and Wolfowitz’ [163] proof of the 
complete class theorem when the number of decisions and the number of distributions is 
finite. Then, the natural action of a randomized estimator on the bounded continuous 
function/mixture pairs Ch{R) x L{A^) is 

fOu := j j /(r)0(x", drMdx^), f € Ch{R), G L(A"). 

Let TZ be endowed with the topology of pointwise convergence with respect to this 
action, that is the weak topology with respect to integration against Cb{R) x L(A"'). 
Moreover, this weak topology also facilitates a definition of the space TZ of generalized 
random estimators as bilinear real-valued maps Z} : Cb{R) x L{A^) M satisfying 
< II/IIooIIlII, > 0 for / > 0, ^ > O^and By [139, 

Thm. 42.3] the set of generalized random estimators TZ is compact and convex and by 
[139, Thm. 42.5] of Le Cam [81], TZ is dense in TZ in the weak topology. Moreover, when 
A^ is dominated and one can restrict to a compact subset C € M of the decision space, 
then Strasser [139, Cor. 42.8] asserts that TZ = TZ. 

Returning to our illustration, if one let G A be defined by W^{r) := V{r — 

<h(/i)), r G M, ^ G A denote the associated family of loss functions, one can now define a 
generalization of the statistical error function £{9,fj,) of (5) to randomized estimators 9 
by 

£{9,n):=j j W^{r)9{x^,dr)gA{dx^), 0G7^,^GA. 

This definition reduces to the previous one (5) when the random estimator 9 corresponds 
to a point estimator 9 and extends naturally to TZ. Finally, one says that an estimator 
1 ?* G 7?. is Bayesian if there exists a probability measure m with finite support on A such 
that 

J £{'9* , ^)m{d^) < J £{9, fi)m{dfi), ZT^TZ. 

The following complete class theorem follows from Strasser [139, Thm. 47.9, Cor. 42.8] 
since one can naturally compactify the decision space M when the quantity of interest 
$ is bounded and the loss function V is sublevel compact, that is has compact sublevel 
sets. 

Theorem 3. Suppose that the loss funetion V is sublevel compact and the quantity of 
interest : A —M is bounded. Then, for each generalized randomized estimator 9 &TZ, 
there exists a weak limit 9* G TZ of Bayesian estimators sueh that 

S{9*, jj) < £{9, p), plgA. 
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If, in addition, A is dominated, then there exists sueh a € IZ. 

A comprehensive connection of these results, where Bayesian estimators are defined 
only in terms of measures of finite support on A, with the framework of Section 4 where 
Bayesian estimators are defined in terms of Borel measures on A is not available yet. 
Nevertheless it appears that much can be done in this regard. In particular, one can sus¬ 
pect that when ^ is a closed convex set of probability measures equipped with the weak 
topology and A is a Borel subset of a Polish space, that if the loss function V is convex 
and <I> is affine and measurable, that the Choquet theory of Winkler [168, 169] can be 
used to facilitate this connection. Indeed, as mentioned above, complete class theorems 
are available for much more general loss functions than continuous or convex, more gen¬ 
eral decision spaces than M, and without absolute continuity assumptions. Moreover, it 
is interesting to note that, although randomization was introduced to obtain minmax 
results, when the loss function V is strictly convex, Bayesian estimators can be shown to 
be non-random. This can be explicitly observed in the definition (16) of Bayesian estima¬ 
tors when V{x) := and is understood much more generally in Dvoretsky, Wald and 
Wolfowitz [32]. We conjecture that further simplifications can be obtained by allowing 
approximate versions of complete class theorems, Bayesian estimators, optimality, and 
saddle points, as in Scovel, Hush and Steinwart’s [129] extension of classical Lagrangian 
duality theory to include approximations. 

5 Incorporating complexity and computation 

Although Decision Theory provides well posed notions of optimality and performance 
in statistical estimation, it does not address the complexity of the actual computation 
of optimal or nearly optimal estimators and their evaluation against the data. Indeed, 
although the abstract identification of an optimal estimator as the solution of an op¬ 
timization problem provides a clear objective, practical applications require the actual 
implementation of the estimator on a machine and its numerical evaluation against the 
data. 

5.1 Machine Wald 

The simultaneous emphasis on performance and computation can be traced back to PAC 
(probably approximately correct) Learning initiated by Valiant [149] which has laid 
down the foundations of Machine Learning (ML). Indeed, as asserted by Wasserman in 
his 2013 Lecture “The Rise of the Machines” [164, Sec. 1.5]: 

“There is another interesting difference that is worth pondering. Consider 
the problem of estimating a mixture of Gaussians. In Statistics we think of 
this as a solved problem. You use, for example, maximum likelihood which is 
implemented by the EM algorithm. But the EM algorithm does not solve the 
problem. There is no guarantee that the EM algorithm will actually find the 
MLE; its a shot in the dark. The same comment applies to MCMC methods. 
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In ML, when you say youve solved the problem, you mean that there is a 
polynomial time algorithm with provable guarantees. ” 

That is, on even par with the rigorous performance analysis. Machine Learning also re¬ 
quires that solutions be efficiently implementable on a computer, and often such efficiency 
is established by proving bounds on the amount of computation required to produce such 
a solution with a given algorithm. Although Wald’s theory of Optimal Statistical Deci¬ 
sions has resulted in many important statistical discoveries, looking through the three 
Lehmann symposia of Rojo and Perez-Abreu [126] in 2004, and Rojo [124, 125] in 2006 
and 2009, it is clear that the incorporation of the analysis of the computational algo¬ 
rithm, both in terms of its computational efficiency and its statistical optimality has 
not begun. Therefore a natural answer to fundamental challenges in UQ appears to be 
the full incorporation of computation into a natural generalization of Wald’s Statistical 
Decision Function framework, producing a framework one might call Machine Wald. 

5.2 Reduction Calculus 

The resolution of minimax problems (11) require, at an abstract level, searching in the 
space of all possible functions of the data. By restricting models to the Bayesian class, 
the complete class theorem allows to limit this search to prior distributions on M, i.e. 
to measures over spaces of measures and functions. To enable the computation of these 
models it is therefore necessary to identify conditions under which Minimax problems 
over measures over spaces of measures and functions can be reduced to the manipulation 
of finite-dimensional objects and develop the associated reduction calculus. For min 
or max problems over measures over spaces of measures (and possibly functions) this 
calculus can take the form of a reduction to a nesting of optimization problems over 
measures (and possibly functions for the inner part) [111, 110, 112], which, in turn, can 
be reduced to searches over extreme points [113, 142, 50, 108]. 

5.3 Stopping conditions 

Many of these optimization problems will not be tractable. However even in the tractable 
case, which has rigorous guarantees on the amount of computation required to obtain an 
approximate optima, it will be useful to have stopping criteria for the specific algorithm 
and the specific problem instance under consideration, which can be used to guarantee 
when an approximate optima has been achieved. Although in the intractable case no 
such guarantee will exist in general, intelligent choices of algorithms may result in the 
attainment of approximate optima and such tests guarantee that fact. Ermoliev, Gaiv- 
oronski and Nedeva [35] successfully develop such stopping criteria using Lagrangian 
duality and generalized Bender’s decompositions by Geoffrion [45] for certain Stochastic 
Optimization problems which are also relevant here. In addition, the approximation 
of intractable problems by tractable ones will be important. Recently, Hanasusanto, 
Roitch, Kuhn and Wiesemann [52] derive explicit conic reformulations for tractable 
problem classes and suggest efficiently computable conservative approximations for in¬ 
tractable ones. 


20 


5.4 On the Borel-Kolmogorov paradox 

An oftentimes overlooked difficulty with Bayesian estimators lies in the fact that for a 
prior vr € A4(A), the posterior ( 12 ) is not a measurable function of d but a convex set 
©(tt) of measurable functions 0 of d that are almost-surely equal to each other under 
the measure vr • B on P. 

A notorious pathological consequence is the Borel-Kolmogorov paradox (see Chapter 
5 of [75] and Section 15.7 of [62]). Recall that in the formulation of this paradox one 
considers the uniform distribution on the two-dimensional sphere and one is interested 
in obtaining the conditional distribution associated with a great circle of that sphere. 
If the problem is parameterized in spherical coordinates, then the resulting conditional 
distribution is uniform for the equator but non uniform for the longitude corresponding 
to the prime meridian. The following theorem suggests that this paradox is generic 
and dissipates the idea that it could be limited to fabricated toy examples. See also 
Singpurwalla and Swift [132] for implications of this paradox in modeling and inference. 

Recall that for vr G A4(A), that ©(tt) is defined as the convex set of measurable 
functions which are equal vr • B-everywhere to the regular conditional expectation (12). 
Despite this indeterminateness it is comforting to know that 

£’( 6 ' 2 , 7 r) = £’(6'i,7r), 6 »i, 6»2 G ©(vr). 

Moreover, it is also easy to see that if -k^ is absolutely continuous with respect to vr, then 
9i{d) = 02 (d) with TT^ • B probability one for all 0i, 02 € ©(vr), and consequently 

£’( 02 , 7 r'f) = £’( 0 i, 7 r^), 0 i, 02 G©( 7 r), vr^ ^ vr, 

where the notation vr^ ^ vr means that vr^ is absolutely continuous with respect to vr. The 
following theorem shows that this requirement of absolute continuity is necessary for all 
versions of conditional expectations 0 G ©(vr) to share the same risk. See Subsection 7.2 
for its proof. 

Theorem 4. Assume that V{x) = and that the image $(M) is a nontrivial interval. 
If vr'l' is not absolutely continuous with respect to vr then 

1 ^ suPei.eaeeW (^(^ 2 , w^) - g( 0 i, w^)) ^ ^ 

4 ifI{A) — C{A)) SUP5g^(x)): (jr.]D))[B]=o(^^ ‘ 

where U{A) and C[A) are defined by ( 2 ) and (3). 

Remark 5. If moreover vr^ • B is orthogonal to vr • B, that is there exists a set B G 
B{'D) such that (vr • B)[R] = 0 and (vr^ • B)[R] = 1, then Theorem 4 implies that 
sup 0 i, 02 Ge(,r) (^(d 2 ,vrt) - £’( 0 i,vrt)) is larger than the statistical error of the midpoint 
estimator 

^__£iA)+UiA) 
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As a remedy one can try (see [144], [145] and [116]) constructing conditional expec¬ 
tations as disintegration or derivation limits defined as 

E^Qo[^{f,fJ-)\D = d]= lim E^0 b[4>(/,/u)|A) G S] (20) 

where the limit B {d} is taken over a net of open neighborhoods of d. But as shown 
in [65], the limit generally depends on the net B {d} and the resulting conditional 
expectations can be distinctly different for different nets. Furthermore the limit (20) 
may exist/not exist on subsets of P of (vr • D)-measure zero (which, as shown above, can 
lead to the inconsistency of the estimator). A related important issue is that conditional 
probabilities can in general not be computed [1]. Observe that if the limit (20) does not 
exist then Bayesian estimation of $(/, p) may have significant oscillations as the precise 
measurement of d becomes sharper. 

5.5 On Bayesian Robustness/Brittleness 

As much as classical numerical analysis shows that there are stable and unstable ways to 
discretize a partial differential equation, positive [12, 22, 27, 74, 79, 140, 152] and negative 
results [5, 26, 41, 42, 64, 83, 111, 110, 112, 107] are forming an emerging understanding 
of stable and unstable ways to apply Bayes’ rule in practice. One aspect of stability 
concerns the sensitivity of posterior conclusions with respect to the underlying models 
and prior beliefs. 

“Most statisticians would acknowledge that an analysis is not complete unless 
the sensitivity of the conclusions to the assumptions is investigated. Yet, in 
practice, such sensitivity analyses are rarely used. This is because sensitivity 
analyses involve difficult computations that must often be tailored to the 
specific problem. This is especially true in Bayesian inference where the 
computations are already quite difficult.” [165] 

Another aspect concerns situations where Bayes’ rule is applied iteratively and posterior 
values become prior values for the next iteration. Observe in particular that when poste¬ 
rior distributions (which are later on used as prior distributions) are only approximated 
(e.g. via MCMC methods), stability requires the convergence of the MCMC method in 
the same metric used to quantify the sensitivity of posterior with respect to the prior 
distributions. 

In the context of the framework being developed here, recent results [111, 110, 
112, 107] on the extreme sensitivity (brittleness) of Bayesian inference in the TV and 
Prokhorov metrics appear to suggest that robust inference, in a continuous world un¬ 
der finite-information, should perhaps be done with reduced/coarse models rather than 
highly sophisticated/complex models (with a level of coarseness/reduction depending on 
the available finite-information) [112]. 
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5.6 Information Based Complexity 

From the point of view of practical applications it is clear that the set of possible mod¬ 
els entering in the Minimax Problem 11 must be restricted by introducing constraints 
on computational complexity. For example, finding optimal models of materials in ex¬ 
treme environments is not the correct objective when these models require full Quantum 
Mechanics calculations. A more productive approach is to search for computationally 
tractable optimal models in a given complexity class. Here one may wonder if Bayesian 
models remain a complete class for the resulting complexity constrained Minimax prob¬ 
lems. It is also clear that computationally tractable optimal models may not use all 
the available information, for instance a material model of bounded complexity should 
not use the state of every atom. The idea that fast computation requires computation 
with partial information forms the core of Information Based Complexity, the branch 
of computational complexity that studies the complexity of approximating continuous 
mathematical operations with discrete and finite ones up a to specihed level of accu¬ 
racy [146, 171, 114, 100, 172], where it is also augmented by concepts of contaminated 
and priced information associated with, for example, truncation errors and the cost of 
numerical operations. Recent results [106] suggest that Decision Theory concepts could 
be used, not only to identify reduced models but also algorithms of near optimal com¬ 
plexity by reformulating the process of computing with partial information and limited 
resources as that of playing underlying hierarchies of adversarial information games. 

6 Conclusion 

Although Uncertainty Quantification is still in its formative stage, much like the state 
of probability theory before its rigorous formulation by Kolmogorov in the 1930s, it has 
the potential to have an impact on the process of scientific discovery that is similar to 
the advent of scientific computing. Its emergence remains sustained by the persistent 
need to make critical decisions with partial information and limited resources. There are 
many paths to its development, but one such path appears to be the incorporation of 
notions of computation and complexity in a generalization of Wald’s decision framework 
built on Von Neumann’s theory of adversarial games. 

7 Appendix 

7.1 Construction of TT 0 D 

The below construction works when A G x Ai(X) for some Polish subset G C J~{X) 
and X is Polish. Observe that since T> is metrizable, it follows from [3, Thm. 15.13], that, 
for any B € B{T>), the evaluation u i-)- ^{B), v € A4(V), is measurable. Consequently, 
the measurability of D implies that the mapping 

D: AxI3(V)^R 
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defined by 


B((/,/i),i?) :=B{f,f,)[B], foT{f,fi)eA,Be)3{V) 

is a transition function in the sense that, for fixed (/, G A, B((/, ^), •) is a probability 
measure, and, for fixed B G B{'D), is Borel measurable. Therefore, by [17, 

Thm. 10.7.2], any vr G M[A), defines a probability measure 

ttOB G M{B{A) X B{V)) 

through 

7rQO[AxB]:= [Uif, //)B(/, //)[B]], for A G B(^), B G B(P), (21) 

where lyi is the indicator function of the set A: 




1 , 

0, if(/,/i)^^. 


It is easy to see that vr is the ^-marginal of vr © B. Moreover, when A is Polish, 
[3, Thm. 15.15] implies that M{X) is Polish, and when Q is Polish it follows that 
A X A4(X) is second countable. Consequently, since V is Suslin and hence second 
countable, it follows from [31, Prop. 4.1.7] that 

B{A xV) = B{A) X B{V) 


and hence vr 0 B is a probability measure on ^ x P. That is. 


ttQB G M{A X V). 


Henceforth denote vr • B the corresponding Bayes’ sampling distribution defined by 
the P-marginal of vr 0 B, and note that, by (21), one has 

vr • B[B] := [B(/, ^,)[B ]], for B G B{V). 

Since both V and A are Suslin it follows that the product ^ x P is Suslin. Con¬ 
sequently, [17, Cor. 10.4.6] asserts that regular conditional probabilities exist for any 
sub-cr-algebra of B{^A x By In particular, the product theorem of [17, Thm. 10.4.11] 
asserts that product regular conditional probabilities 

(vr © B) Id G AI(^), for d G P 
exist and that they are vr • B-a.e. unique. 
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7.2 Proof of Theorem 4 

If TT^ • B is not absolutely continuous with respect to vr • B then there exists B € B{'D) 
such that (tt • B)[i?] = 0 and (vr^ • B)[i?] > 0. Let 9 G ©(tt). Define 

Oyid) ■.= e{d)lB^{d)+ylB{d) (22) 

Then it is easy to see that if y is in the range of $ then 9y G ©(vr). Now observe that 
for y,z & Image(‘h), 

S{9y,7r'^) -£{e^,7r^) = 

Hence, for V{x) = it holds true that 

£i9y,TT'f) -£{e^,TT'f) = [{y-^if - (z - 7 )^] (tt^ •B)[H] 

with 

7 := E^toD[$|D G B\ 

which proves 


lB{d) {y[y - <!>(/, ^i)) -V[z- $(/, y))) 


sup £’( 02 ,vr'l') 
02G0(7r) 


— inf £{9i,'k^)> sup 

^lG0(vr) BeB(V): (7r-D)[B]=0, 3 /,zGlmage(<I>) 

{y - G H])' - (z - E^toD[^l^ € Bf 


(7rt.D)[i?], 


and, 


sup £{ 62 ,tt^) — inf £{ 6 i, 7 r^) < (U{A) — C{A))‘^ sup (7r^-B)[H]. 

02e0(7r) 6»iG0(7r) B£B{V) : {n-O)[B]=0 

To obtain the right hand side of (19) observe that (see for instance [28, Sec. 5]) there 
exists B* G B{'D) such that 

(7r+-B)[H*]= sup (7r^-B)[H] 

BeB{v)-.{-K-n)[B]=o 

and (since 62 = on the complement of B*) 
sup (T(6»2,7r^) - £’(6'i,7r^)) 

sup 0D 

6\fi2&Q(A 


Ib^( d)[v{e2 - Hf,y)) - P(01 - $(/, y))) 


We conclude by observing that for V{x) = 

sup (v{e2-Hf.B))-y{0l-Hf.^^))) < {U{A)-C[A)f. 

6 'l, 6 ' 2 G©( 7 r) ^ ^ 
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7.3 Conditional expectation as an orthogonal projection 

It easily follows from Tonelli’s Theorem that 


By considering the sub cr-algebra A x B{T>) C B{A x V) = B{A) x B{V) it follows from 
e.g. Theorem 10.2.9 of [31] that L^.p(T>) is a closed Hilbert subspace of the Hilbert space 
L^©iu(yl X V) and the conditional expectation of <1> given the random variable D is the 
orthogonal projection from L‘^qj^{A x V) to 
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